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Abstract 

Background: Characterizing the biogeography of the microbiome of healthy humans is essential for 
understanding microbial associated diseases. Previous studies mainly focused on a single body habitat from a 
limited set of subjects. Here, we analyzed one of the largest microbiome datasets to date and generated a 
biogeographical map that annotates the biodiversity, spatial relationships, and temporal stability of 22 habitats 
from 279 healthy humans. 

Results: We identified 929 genera from more than 24 million 16S rRNA gene sequences of 22 habitats, and we 
provide a baseline of inter-subject variation for healthy adults. The oral habitat has the most stable microbiota with 
the highest alpha diversity, while the skin and vaginal microbiota are less stable and show lower alpha diversity. 
The level of biodiversity in one habitat is independent of the biodiversity of other habitats in the same individual. 
The abundances of a given genus at a body site in which it dominates do not correlate with the abundances at 
body sites where it is not dominant. Additionally, we observed the human microbiota exhibit both cosmopolitan 
and endemic features. Finally, comparing datasets of different projects revealed a project-based clustering pattern, 
emphasizing the significance of standardization of metagenomic studies. 

Conclusions: The data presented here extend the definition of the human microbiome by providing a more 
complete and accurate picture of human microbiome biogeography, addressing questions best answered by a 
large dataset of subjects and body sites that are deeply sampled by sequencing. 
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Background 

Biogeography in traditional ecology attempts to describe 
spatial and temporal patterns of biological diversity of 
organisms [1]. It seeks to answer what organisms are pre- 
sent, how they are distributed, and how they vary over 
time. Biogeographical patterns allow one to explore what 
ecological rules apply, such as cosmopolitan (global) or 
endemic (confined to a location) distribution patterns of 
taxa. 

The human body is composed of many niches that are 
directly or indirectly exposed to the external environment 
and are modulated by interaction with multiple systems 
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such as the host immune, endocrine, and nervous systems. 
Each niche thus serves as a unique and complex space for 
microbes to survive and thrive. Alteration of the microbial 
balance in humans is associated with disease [2-12]. 
Characterization of the human microbial community by 
culture-independent, high-throughput sequencing 
unveiled the high diversity of microbes in different habitats 
of the body. The gut microbiome is the best-characterized 
human microbial habitat, and it was originally estimated 
to harbor 500 to 1,000 species [13]. Bacterial community 
structures in other habitats of the human body such as 
the oral, skin, nasal, and vaginal areas also have been 
addressed [14-21]. Over 500 species-level phylo types were 
discovered in oral habitats [16,22]. Two hundred and five 
genera were identified on the skin surface from 10 subjects 
[18]. Traditionally, urine has been thought to be sterile but 



o 



© 2013 Zhou et a!.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons 
BIoIVIGCI C6ntr3l Attribution License (http://creativecommons.0rg/licenses/by/2.O), which permits unrestricted use, distribution, and reproduction in 
any medium, provided the original work is properly cited. 



Zhou et al. Genome Biology 2013, 14:R1 
http://genomebiology.eom/content/1 4/1 /R1 



Page 2 of 1 8 



culture-independent sequencing uncovered diverse micro- 
bial communities and substantial intra- and inter-subject 
variability [7]. 

High variation of microbial composition between indi- 
viduals requires deeper sequencing and larger subject 
numbers to have a more complete view of human 
microbiome. The NIH launched the Human Micro- 
biome Project (HMP) [23], aiming to more fully charac- 
terize the human microbiota and address its role in 
health and disease. This project enrolled 300 healthy, 
young adults, collected samples from 15 (male) or 18 
(female) habitats in the body, and produced datasets of 
both 16S rRNA gene and whole genome shotgun 
(WGS) sequences [24]. The 16S rRNA gene sequences 
aim to define community structures in different habitats, 
representing major niches of the human body, including 
stool, oral, skin, nasal, and vaginal areas. The HMP con- 
sortium has been conducting extensive analysis with the 
HMP 16s rRNA gene dataset to explore different aspects 
of human microbiome [24-29]. RDP-based taxonomic, 
operational taxonomic unit (OTU), and phylogenetic 
approaches are the main analytic methods used to char- 
acterize microbial community structure [24,25,28]. The 
RDP taxonomic approach used in this study possesses 
easy interpretability and better sequence error tolerance. 
It provides confident taxonomic assignment at the 
genus and other higher taxonomic levels. The OTU 
approach offers a sub-genus level resolution. However, 
the OTU does not always reflect a biological unit, and 
technical factors (single, average, complete linkage) lar- 
gely affect the components within an OTU. The phylo- 
genetic approach dependent on the tree construction 
provides phylogeny of the bacterial community, but it 
inevitably bears the intrinsic problems of tree construc- 
tion using short reads. Each of these methods has its 
pros and cons, and they are complementary to each 
other. They provide equally important insight into the 
bacterial community structure [30]. 

As a companion paper of the HMP main papers, we 
presented an extensive taxonomic analysis of the major- 
ity of the HMP samples as well as additional samples 
comprising preterm baby stool, two urogenital sites, and 
the conjunctiva in adults, thus expanding the view of 
the microbiome beyond that of the HMP. We addressed 
the following aspects of human microbial biogeography 
in the context of a large cohort and deep sequencing: 
how many organisms inhabit the human body; is the 
biodiversity of one habitat influenced by other habitats; 
is the presence or abundance of the organism in one 
habitat affected by other habitats; what is the bacterial 
distribution pattern in a habitat; what is the degree of 
inter-personal variation and temporal variation in differ- 
ent habitats; are the HMP data comparable with the 
prior human microbiome studies. The analysis focusing 



on the above questions are not addressed in other HMP 
companion papers or are insufficiently investigated in 
the main papers. Answering these questions allows us to 
have a more complete and broad understanding of the 
ecosystems of the human body. 

Results 

Overview of the datasets 

Sample collection, DNA extraction, sequencing, as well 
as data processing followed the manual of procedures of 
the HMP consortium [24]. Based on investigations of 
mock communities and a pilot study of 24 subjects from 
the HMP, we found bacterial community structure, 
including composition and abundance, was not biased 
by the center performing sequencing and there was con- 
sistency of community structure between technical repli- 
cates [24,25,31]. 

To have a broader view of the human microbiome, we 
used datasets from the HMP and healthy controls from 
two HMP demonstration projects, necrotizing enteroco- 
litis (neonatal stool) and urethritis microbiome (urine 
and penis) as well as a study of the conjunctiva. All sub- 
jects were adults except for necrotizing enterocolitis 
(preterm babies) and urethral microbiome (adolescents). 
All samples were from the USA except conjunctiva (The 
Gambia). The description of each dataset is summarized 
in Table 1. 

Over 24 million, high-quality 16S rRNA gene sequences 
(generated on the Roche-454 Titanium FLX platform) 
were included in the analysis. All sequences were gener- 
ated from the V3-V5 variable regions of the 16S rRNA 
gene with the exception of conjunctiva (VI -V3, due to the 
better amplification of V1-V3 of conjunctiva samples). 
The reads were from 2,983 specimens, sampled from 236 
HMP healthy subjects at 15 (male) or 18 (female) body 
habitats, 18 subjects at urine and penis habitats, 10 pre- 
term babies at the stool habitat, and 15 subjects at the 
conjunctiva (Table 1). Raw reads were processed by filter- 
ing low quality reads and removing chimeras as described 
by the HMP [24,25,31]. The average read length of V3-V5 
was 468 ± 82 bp, and the average sequencing depth was 
8,167 reads ± 5,556 reads (mean ± sd). All reads were 
further classified by the RDP classifier [32] (version 2.2 
with training set 6) from phylum to genus level at the 50% 
confidence threshold. 

Biodiversity of human microbiota 

One of the goals of the HMP is to characterize the bac- 
terial composition and distribution pattern in and on 
the human body. In the combined extended datasets in 
the present analysis, we identified 30 phyla, 51 classes, 
125 orders, 493 families, and 929 genera from all body 
habitats (Table 1). In addition, a wide range of unclassi- 
fied organisms was present in each habitat (Figure SI in 



Table 1 Cumulative and average number of taxa per body habitat 
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Body habitat 


Project 


Subjects in) 


Reads 


Phylum 


Class 


Order 


Family 


Genus 


Reads (m ± sd) 


Phylum 


Class 


Order 


Family 


Genus 


Genus_1000^ 


Gut 

Stool 


HMP 


209 


1 774406 


13 


23 


38 


76 


203 


8490 ± 8308.1 


6 ± 


0.1 


10 ± 0.2 


12 ± 0.3 


25 ± 


0.5 


54 ± 1 .2 


32 


± 0.7 


Stool (preterm) 


NEC 


10 


84574 


6 


12 


21 


39 


90 


8457 ± 3505.8 


4 ± 


0.3 


5 ± 0.4 


7 ± 0.6 


12 ± 


1.1 


26 ± 2.5 


14 


± 0.7 


Nasal 












































Anterior nares 


HMP 


166 


1178682 


20 


38 


75 


165 


457 


7100 ± 4031.8 


7 + 


0.1 


12 ± 


0.2 


1 7 


+ 


0.3 


34 ± 


0.8 


53 ± 1.5 


27 


± 0.9 


Oral 












































Buccal mucosa 


HMP 


186 


1542387 


15 


25 


46 


99 


241 


8292 ± 5166.5 


7 ± 


0.1 


12 ± 


0.2 


1 7 


+ 


0.2 


33 ± 


0.5 


59 ± 1.1 


33 


± 0.7 


Hard palate 


HMP 


193 


1551365 


19 


34 


62 


127 


301 


8038 ± 4518.7 


7 + 


0.1 


13 ± 


0.2 


18 


+ 


0.3 


36 ± 


0.5 


63 ± 1 


39 


± 0.6 


Keratinized gingiva 


HMP 


199 


1553381 


13 


22 


37 


88 


218 


7806 ± 4591.9 


6 ± 


0.1 


11 ± 


0.1 


14 


± 


0.2 


25 ± 


0.4 


41 ± 0.9 


23 


± 0.5 


Palatine tonsil 


HMP 


199 


1648985 


15 


25 


47 


93 


242 


8286 ± 5068.5 


8 ± 


0.1 


14 ± 


0.2 


18 


+ 


0.3 


35 ± 


0.5 


62 ± 1 


40 


± 0.6 


Saliva 


HMP 


182 


1335119 


17 


29 


49 


99 


266 


7336 ± 4083.7 


9 ± 


0.1 


15 ± 


0.2 


20 


± 


0.3 


39 ± 


0.5 


72 ± 0.9 


50 


± 0.6 


Subgingival plaque 


HMP 


201 


1603380 


14 


23 


42 


84 


244 


7977 ± 4364.2 


8 ± 


0.1 


14 ± 


0.2 


19 


± 


0.2 


38 ± 


0.4 


68 ± 0.9 


46 


± 0.6 


Supragingival plaque 


HMP 


208 


1677580 


15 


24 


40 


77 


219 


8065 ± 4539.6 


8 ± 


0.1 


13 ± 


0.2 


17 


± 


0.3 


35 ± 


0.5 


62 ± 0.9 


41 


± 0.7 


Throat 


HMP 


186 


1455561 


18 


29 


53 


119 


322 


7826 ± 3784.6 


8 ± 


0.1 


13 ± 


0.2 


18 


± 


0.3 


36 ± 


0.5 


63 ± 1 


41 


± 0.6 


Tongue dorsum 


HMP 


206 


1 769242 


13 


21 


35 


74 


185 


8589 ± 7921.7 


7 + 


0.1 


13 ± 


0.2 


16 


± 


0.2 


32 ± 


0.4 


56 ± 0.8 


35 


± 0.5 


Skin 












































Antecubital fossa (left) 


HMP 


75 


470729 


21 


41 


78 


168 


501 


6276 ± 4981.6 


8 ± 


0.2 


13 ± 


0.4 


23 


± 


0.7 


48 ± 


1 .7 


81 ± 3.9 


56 


± 2.5 


Antecubital fossa (right) 


HMP 


87 


541278 


21 


42 


81 


186 


534 


6222 ± 5348.3 


8 ± 


0.2 


14 ± 


0.4 


23 


± 


0.8 


48 ± 


1.9 


80 ± 3.9 


52 


± 2.4 


Retroauricular crease (left) 


HMP 


181 


1762310 


20 


38 


75 


162 


474 


9737 ± 6158.1 


7 ± 


0.1 


11 ± 


0.2 


17 


± 


0.4 


31 ± 


1 


47 ± 1.9 


19 


± 1.0 


Retroauricular crease (right) 


HMP 


189 


1710301 


23 


42 


79 


173 


484 


9049 ± 4839.4 


6 ± 


0.1 


11 ± 


0.2 


16 


± 


0.4 


30 ± 


0.9 


44 ± 1 .7 


19 


± 1.0 


Ocular 












































Conjunctiva 


The Gambia 


15 


158577 


20 


37 


70 


165 


409 


10571 ± 10554. 


8 ± 


0.8 


14 ± 


0.9 


21 


± 


1.4 


37 ± 


2.8 


81 ± 8.7 


54 


± 5.7 


Urogenital 












































Mid vagina 


HMP 


89 


732322 


10 


19 


36 


75 


218 


8228 ± 6675.7 


5 ± 


0.1 


8 ± 0.2 


11 


± 


0.4 


20 ± 


0.8 


29 ± 1 .5 


12 


± 0.9 


Posterior fornix 


HMP 


89 


822111 


10 


18 


31 


70 


176 


9237 ± 7652 


5 ± 


0.1 


7 ± 0.2 


9 ± 0.4 


15 ± 


0.7 


20 ± 1 .2 


7 ± 0.6 


Vaginal introitus 


HMP 


80 


660867 


13 


20 


35 


75 


203 


8261 ± 4389.5 


5 ± 


0.1 


9 ± 0.2 


12 ± 0.4 


23 ± 


0.8 


34 ± 1 .5 


16 ± 0.9 


Urine 


Urethritis 


18 


152360 


23 


38 


64 


131 


312 


8464 ± 4694.9 


9 ± 


0.5 


14 ± 1.4 


24 ± 2.5 


49 ± 


5.6 


61 ± 5.2 


38 ± 4.8 


Penis 


Urethritis 


18 


203866 


20 


35 


64 


135 


333 


11326 ± 3523.4 


8 ± 


0.7 


14 ± 


1 


22 ± 1 .9 


42 ± 


3.3 


68 ± 5.5 


30 ± 3.0 


Total 


All projects 


279 


24385112 


30 


109 


125 


493 


929 


8173 ± 5556 


ND 




ND 




ND 




ND 




ND 


ND 



^1,000 reads were subsampled from each sample and the average number of genera is listed. 



Zhou et al. Genome Biology 2013, 14:R1 
http://genomebiology.eom/content/1 4/1 /R1 



Page 4 of 1 8 



Additional file 1), from which novel organisms at the 
genus level or higher rank were identified [33]. 

The 30 bacterial phyla observed represent less than half 
of the known bacterial phyla [34]. The human habitats 
contain a large number of Firmicutes, Actinobacteria, 
Proteobacteria, and Bacterioidetes (Figure S2A in Addi- 
tional file 1). Actinobacteria and Proteobacteria were the 
most predominant phyla in the marine and soil micro- 
biomes [35,36] whereas Firmicutes were less dominant, 
indicating a major difference between the external envir- 
onments, the presumed source of microbes, and the 
human body. Bacterial distributions at the phylum level 
exhibited different patterns in different habitats. The dif- 
ferences were revealed in both abundant phyla as well as 
low abundance phyla (<0.5% of the average relative abun- 
dance) (Figure S2A, S2B in Additional file 1). 

At the genus level we identified 501 and 534 genera 
from left and right antecubital fossa, respectively, and 
474 and 484 genera from left and right retroauricular 
crease, respectively, making these four skin sites the 
richest communities even though the number of sub- 
jects sequenced from skin habitats was less than oral 
and stool habitats. As indicated by the accumulation 
curves in Figure 1, skin and skin-associated sites con- 
tinue to yield more taxa with increasing numbers of 
subjects. On the other hand, the slopes of the accumula- 
tion curves for the three vaginal sites, seven of nine oral 
sites (except throat and hard palate), and the stool 
become asymptotically flatter, suggesting the sampling is 
close to saturation for those habitats. This indicates a 
lower richness of genera in these sites. 

The accumulation curves in Figure 1 represent the 
overall richness of each habitat, which is a function of 
both sequence depth and sample size. Accumulation 
curves based on subsampling down to 1,000 reads show 
similar patterns as using all of the reads, with skin being 
the richest community (Table 1, Figure S3A in Addi- 
tional file 1). 

Richness is one dimension of biodiversity. Shannon 
diversity is another diversity index that measures both the 
richness and evenness. To compare the differences of 
diversity between habitats, a t-test was performed for each 
combination of two different habitats. P values corrected 
by Bonferroni method are summarized in Table SI in 
Additional file 2. Oral sites, particularly saliva, have the 
highest evenness (Figure 2). Buccal mucosa and kerati- 
nized gingiva have lower diversity than the other seven 
oral sites. Posterior fornix of the vagina shows the lowest 
diversity. Stool and skin sites show moderate diversity. 
In contrast to the overall richness estimated at the 
genus level, the stool habitat had the highest species 
(OTU) level richness (with an OTU defined as 97% 
identity of sequences), followed by oral, skin, and vaginal 
habitats [28]. Stool, the best-studied habitat, was ranked 



second to last in terms of community richness at the 
genus level among all of the 18 HMP habitats. The rich- 
ness difference observed in genus and OTU level may 
be because: (1) the richness observed at genus level is 
database-dependent (RDP database); or (2) there are 
more unclassified genera in stool and oral sites than 
skin and vaginal sites (Figure SI in Additional file 1), 
and the unclassified genera may contain different OTUs. 

Universal distribution pattern of human microbiota 

The bacterial distribution patterns in all 22 human habi- 
tats follow a general rule, which is seen in other sys- 
tems: all communities are dominated by from one to 
several genera, and rank abundance curves (RAC) have 
a long tail of less abundant organisms (Figure 3). These 
minor organisms are detected after quality filtering and 
chimera removal and thus are unlikely to be due to 
these types of artifacts. Concern has been previously 
raised over the possibility that minor species/OTUs are 
nevertheless 'noise' due to various issues. One of these 
is sequencing artifacts that are not removed by quality- 
based filtering [37,38]. As noted earlier, our methods 
were developed using mock community approaches, in 
which a limited number of false taxa were detected [31]. 
Moreover, different from the studies at the species/OTU 
level, analysis at the genus level is also less likely to be 
as sensitive to false taxon calls due to small sequence 
errors. Thus we do not consider sequencing errors to be 
a major source of the minor organisms detected. 

As described previously, the best evidence for the gen- 
uine existence of a minor taxon is its appearance across 
many distinct samples [38]. If samples are truly distinct, 
the possibility of contamination as the source of minor 
taxa that are prevalent among samples is reduced. The 
HMP samples were collected separately in St. Louis and 
Houston and sequenced at four sequencing centers, and 
may thus be regarded as such distinct data sources. We 
note below that genera that are present at low abun- 
dances, for instance nasal Streptococcus have an average 
relative abundance of 2%, but are found in >80% of the 
nasal samples. Two-thirds of those samples with Strepto- 
coccus are from Houston, and sequenced roughly evenly 
at the four sequencing centers, making it unlikely to be 
noise from contaminants (Figure 4). 

There are other organisms that are present at low 
abundance but are found in few samples. To further 
explore the prevalence of these minor organisms, we 
chose 60 stool samples that had >9,000 reads and sub- 
sampled 1,000, 3,000, 6,000, and 9,000 reads from these 
samples. At each sequencing depth, the prevalence of 
each taxon was calculated (Figure S4 in Additional file 
1). With increasing sequencing depth, the prevalence 
among the subjects for the majority of the taxa at the 
1,000 read depth increased 3.6-fold on average. As an 
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Figure 1 Accumulation curves of 22 hiabitats. Each line represents the accumulative richness from all subjects. All the reads were included in 
the analysis to have a full view of the genera revealed by the HMP and other datasets. At the genus level, oral, vaginal, and stool habitats 
become asymptotically flat at the current sampling depth and sampling efforts. More subjects are needed to reach the saturation for skin and 
skin-associated habitats. 



example of a well-known minor genus, that is not a false 
taxon, we analyzed Escherichia/Shigella, and found it in 
five out of 60 subjects at 1,000 reads, and increased to 
24 subjects at 9,000 reads. As a second example, we 
chose Corynebacterium and Propionihacterium, which 
make up the majority of skin microbiota, but are not a 
major part of the fecal flora. They were each identified 
in one subject at 1,000 read depth and increased to 11 
and six subjects, respectively, at 9,000 reads. These 
examples show that there are likely real taxa among the 
low abundance genera that are not as prevalent among 
subjects. 



Diversity correlation between major habitats from the 
same subjects 

Many previous studies addressed the bacterial community 
biodiversity of a single habitat, thus biodiversity correla- 
tions between habitats were not described. To answer this 
question, we chose two different body sites from the same 
subjects and computed their richness. Correlation of body 
sites' richness within a subject was then expressed by the 
Spearman correlation coefficient. The richness correlation 
was 0.68 between subgingival plaque and supragingival 
plaque, 0.66 between left and right retroauricular crease, 
0.68 between left and right antecubital fossa, and 0.41 to 
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Figure 2 Shannon diversity comparisons of 22 habitats. One thousand reads were rarefied from each sample. The boxplot shows the 
minimum, 25th percentile, median, 75th percentile, and the maximum of the data from bottom to top. Oral habitats in general have more even 
bacterial communities, and vaginal habitats have the lowest Shannon diversity. 

J 



0.68 within vaginal sites, indicating a stronger correlation 
for those pairs. In contrast, correlations between skin and 
oral, skin and vaginal, or skin and stool sites were very 
low, indicating richness was not correlated (Figure 5A). 
Therefore, similar body sites had stronger correlation of 
richness, while dissimilar sites had little or no correlation. 
More specifically, an individual who harbors a greater col- 
lection of taxa in saliva is more likely to have more taxa 
on the tongue, but not necessarily more taxa in skin, stool, 
or vagina. It has been well known that major habitats are 
distinctive by their specific dominant taxa [25,39], and a 
finer distinction between similar habitats, such as oral 



sites, is achievable at the OTU level [28]. The taxa prefer- 
ence to certain habitats is generally explained by environ- 
mental selection. The richness specificity for the major 
habitats that we found here contributed to another inter- 
esting ecological observation. 

Beta diversity, as measured by the Bray-Curtis dissimi- 
larity index, also correlated well for similar body sites 
and weakly for dissimilar body sites (Figure 5B). That is, 
two individuals who have similar bacterial communities 
in left and right retroauricular creases show similar com- 
munities in their antecubital fossae. This result does not 
provide support for the possibility that host genotype is 
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Figure 3 Bacterial distribution patterns viewed by ranl< abundance curves. The genus distributions are illustrated by rank abundance 
curves. The x-axis represents the ranked genera from high to low. The y-axis shows the average relative abundance for a given genus. Twenty- 
two different line shapes and nine different colors represent the 22 habitats in this study. One or a few genera dominate each habitat with a 
long tail representing rare genera. This bacterial distribution pattern agrees with the species abundance pattern in other environments. 



the major modulator of microbial content since no indi- 
viduals w^ere observed MAith higher diversity in all body 
sites, as might be expected if a host mechanism control- 
ling microbiome diversity w^ere a major factor. How^ever, 
the sample size and other factors such as environmental 
determinants may obscure smaller effects of the host. 

Quantitative measurement of bacterial community 
variation 

Understanding the variation of bacterial communities 
between healthy subjects is a prerequisite to investigating 
the role of microbiota in health and disease. Previous stu- 
dies of feMAer subjects revealed high interpersonal variation 
of bacterial community structure. Because this study fea- 
tures more subjects and deep sequencing than previous 
studies, we are able to more accurately assess this varia- 
tion. Here, we evaluated the interpersonal variation at the 
single taxon level as M^ell as the w^hole community level. 

The variation of a single taxon in a habitat was measured 
by the range and quantiles of the taxon relative abundance. 
The quantiles of the top 20 most abundant genera in each 
site are summarized in Table S2 in Additional file 2, which 
provides a reference on bacterial variation of healthy 



subjects for future studies. The taxon variation in vaginal 
sites was evident by the wide range of Lactobacillus 
(0-100%). In particular, at 4,000 read depth, the Lactobacil- 
lus genus was not detected in some subjects, and was the 
only genus detected in other subjects. This large variation 
of dominant genera was also seen in four skin sites, ante- 
rior nares, and penis samples, and in stool samples where 
the relative abundance of Bacteroides ranged from 1.3% to 
98.2%. 

Oral sites showed a more even abundance of major 
organisms compared to skin and vaginal sites (Figure 2), 
and associated with this, the genera in oral habitats have 
a narrower range of abundances. For instance, the rela- 
tive abundance of Prevotella in saliva ranges from 2.3% 
to 47.4%. 

While variation of the abundance of a single taxon 
between subjects is important, the variation of the abun- 
dance of the whole bacterial composition is another 
important measure. The HMP consortium not only pro- 
duced a large amount of sequence data, but it also 
developed new analytical tools to cope with the high 
dimensional metagenomic data [40,41]. Given the multi- 
variate nature of the metagenomic data, we recently 
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developed multivariate statistical method, Dirichlet- 
multinomial distribution, to model the variation of the 
whole bacterial community composition in a habitat 
[41]. In the model, theta is an overdispersion parameter 
reflecting the variance of each taxon and covariance 
between taxa. It ranges from 0 to 1, with the higher 
value representing higher variation (Figure 6). Lowest 
variation (theta <0.1) was observed for the nine oral 
sites and stool samples. Two of the three habitats show- 
ing highest variation were vaginal sites. Although these 
sites had the lowest alpha diversity, due to the domi- 
nance of Lactobacillus in most samples, the high over- 
dispersion resulted from the subset of samples with low 



Lactobacillus abundance. Skin habitats also showed 
higher overdispersion presumably due to their exposure 
to the environment. High variability was also observed 
for urine and preterm baby stool samples as expected 
since neither of these habitats shows consistent patterns 
of organisms between subjects. 

The numbers of organisms that we detected in samples 
were largely influenced by the sequence depth and num- 
ber of sampled individuals (Figure S3B in Additional file 
1). However, community variation was not sensitive to 
sequencing depth. Community variation measured by 
theta in DM model at 1,000, 3,000, 6,000, and 9,000 read 
depths was not significantly different {P >0.05) (Figure 6). 
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Figure 5 Biodiversity correlation between habitats. The symmetric plot was used to show the association of biodiversity between paired 
habitats. The size and the redness of the circle represent the degree of correlation. Large size and deep redness indicate strong correlation. The 
large red circles on the diagonal line represent self-comparisons. Proximal habitats have similar alpha and beta diversity. (A) Alpha diversity 
association. The association of bacterial richness of different habitats from the same individuals is expressed by the Spearman correlation 
coefficient. (B) Beta diversity association. Mantel correlation was used to compare the Bray-Curtis dissimilarity matrix. 



Cosmopolitan and endemic features of human microbiota 

In microbial ecology, the abundant species are found in 
many samples, that is, show high prevalence. Similarly, 
highly abundant genera in the human microbiota were 
present in higher percentage of subjects, whereas low 
abundance genera were restricted in distribution. This 
phenomenon is shown for the anterior nares (Figure 4) 
and the other body sites (Figure S5A, 5B, 5C in Additional 
file 1). Propionibacterium and Staphylococcus are present 
in all anterior nares samples with the abundance of 23% ± 
2.6% and %19 ± 3.0% (m ± sd), respectively. The majority 
of genera (92%) in the anterior nares are <1% abundance. 
However, we do not have evidence to suggest that they are 
noise within the sequence dataset as described above. 
Interestingly, some of the relatively low abundance genera 
are also widely distributed across subjects. For example. 
Streptococcus and Anaerococcus are both present in >80% 
of the nasal samples, however, their relative abundances 
are about 2% ± 0.8 and 2% ± 0.3 on average. This observa- 
tion was also evident in stool and oral sites. Coprococcus 
in stool, Fusobacterium in throat and hard palate, and 
Haemophilus in hard palate are all low abundance and 
high prevalence. As a result, the cosmopolitan aspect of 
human microbiota is not limited to high abundant taxa. 



On the other hand, a small group of subjects can con- 
tain relatively highly abundant genera that are unique to 
those subjects. For example, the genus Moraxella was 
present in 17.5% of the anterior nares samples with rela- 
tive abundance of 21.4% ± 3.8 (Figure 4). Prevotella 
identified from stool was highly represented in 39% of 
the samples (10.8% ± 2.3 of relative abundance) (Figure 
S5 Additional file 1). The presence of high abundance 
genera in a subset of the cohort indicates the endemic 
aspect of human microbiota. 

Analyzing the taxa distribution in a whole body view, we 
found that there were 39 genera present in all the 18 habi- 
tats of the HMP data in at least one subject (Figure 7). 
Extending this beyond the HMP habitats, and using this 
same criterion, we found that 12 genera are present in at 
least one subject of each of the full set of 22 habitats. These 
12 genera are found in disparate body sites in samples ran- 
ging from the USA to Africa, indicating the extreme cos- 
mopolitan nature of the organisms of these genera. In 
addition to the intercontinental distribution, these genera 
were identified in preterm baby, adolescent, and adult sam- 
ples, suggesting their ubiquity is not limited by age. 

There is no single genus present in all habitats of all 
subjects. Streptococcus is the most widespread genus 
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across the whole body with relatively high prevalence. It 
was found in all the subjects in the oral habitats, nearly 
all of the skin sites and anterior nares samples, about 
50% of vaginal samples, and 30% of stool samples. To 
investigate whether a locally abundant genus affects its 
abundance in less dominant sites, the abundance of 
Streptococcus from throat, Lactobacillus from posterior 
fornix, Bacteroides from stool were compared with their 
abundances in the rest of habitats (Figure S6 in Addi- 
tional file 1). The abundance of Streptococcus in throat 
has moderate correlation with the abundance of Strepto- 
coccus in other oral habitats except subgingival and 
supragingval plaques, and it has no correlation with 
non-oral sites. Similarly, no correlation was found 
between stool and non-stool site for Bacteroides, vaginal 
and non-vaginal sites for Lactobacillus, The strong asso- 
ciation of microbes within habitats and lack of associa- 
tion between major habitats is consistent with the 
findings from another HMP companion paper focusing 
on the microbial co-occurrence in human bodies [26]. 



However, both of our conclusions were based on the 
genus level analysis, while species and strain level analy- 
sis may reveal a different picture of the correlation pat- 
tern within habitat since the same genus from two 
different habitats may represent different species. 

Temporal variation of bacterial community 

To gain insight into temporal changes of the human 
microbiome in the 18 HMP body habitats, we evaluated 
the community similarity between two visits by Spear- 
man correlation (Figure 8). The mean time interval 
between visits is 212 days (Table S3 in Additional file 2). 
Bacterial communities in oral and stool habitats have 
strong correlation (>0.6) between first and second visits, 
thus representing stable communities, whereas skin/ 
nares and vaginal sites have weak correlation (<0.4) 
between visits, thus representing variable communities. 
Furthermore, within a body site, correlation between vis- 
its varies greatly from person to person, especially in skin 
and vaginal sites. 
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Figure 7 Taxa distribution across habitats. Thirty-nine genera present in at least one subject for all 18 HMP habitats are plotted. Each symbol 
represents a habitat. The y-axis shows the sample prevalence of the genera calculated as the number of samples who harbor the genus divided 
by the total number of subjects. Using this criterion, 12 of these genera are spread across all 22 habitats. 



Consistent with previous studies on the dynamics of 
the vaginal microbiota [42], we also detected a signifi- 
cant change in the relative abundance of Lactobacillus 
between visits in 11 out 29 repeated samples (Figure S7 
in Additional file 1). A decrease of Lactobacillus was 
accompanied by an increase in the bacterial vaginosis- 
associated genera such as Gardnerella, Prevotella, and 
Atopobium, 

We also observed drastic variation in bacterial com- 
munities of the anterior nares between visits. Moraxella 
is a genus that colonizes mucosal surfaces and can give 
rise to opportunistic infections. It is also one of the 
notable causes of otitis media and sinusitis. In this data- 
set, Moraxella was detected in 17.5% of the sampled 
individuals. High variation was detected in nine out of 
48 paired samples. For example, in one subject Morax- 
ella was 74.3% in abundance in the first visit, but 
dropped to 0% in the second visit while another subject 
showed 0% in first visit and 66.2% in second visit 
(Figure S8 in Additional file 1). The increased abun- 
dance of Moraxella was accompanied by decreasing 
abundance in other genera, such as Staphylococcus, 
Propionibacterium, and Corynebacterium, 

The time interval between two sampling points ranged 
from 30 days to 359 days and we additionally examined the 
correlation of the time interval and bacterial community 
variability. No strong correlation between time interval and 
the degree of bacterial variability was found (Table S3 in 



Additional file 2). A weak negative correlation was found 
in left retroauricular crease and throat. 

Comparison of HMP data with other human microbial 
data 

The HMP seeks to define the baseline of a healthy micro- 
biome. The large datasets produced are potentially of use 
as healthy controls for other studies. To investigate the uti- 
lity of HMP data for other studies, we examined the com- 
parability of HMP data with published data. We used the 
data from a study of the microbiota of lean and obese 
twins (154 subjects) and a Chinese saliva study of the 
microbiota in dental caries (45 subjects) (Table S4 in Addi- 
tional file 2) [43,44]. Reads were reprocessed as had been 
done for the HMP dataset to exclude bias caused by analy- 
sis pipelines. All samples were subsampled to 1,000 reads. 

The taxa identified from different projects were rela- 
tively consistent. For instance, 72% of the prevalent gen- 
era (defined as present in at least 50% of the samples) 
are present in both HMP saliva samples and Chinese 
saliva samples. Nine out of 12 phyla identified in HMP 
stool samples were also identified by the twin study at 
1,000 read depth. The three phyla unique to HMP stool 
samples were present in only 4, 14, and 16 reads in total 
and thus were quite minor and possibly absent due to 
sampling rather than biological issues. 

Although different studies detected similar prevalent 
or abundant taxa, the frequency of taxa varied among 
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Figure 8 Community stability over time. The longitudinal studies were based on 18 habitats from HMP. The similarity of the bacterial 
communities of each subject between two sampling points was evaluated by Spearman correlation coefficient (y-axis). Oral habitats and stool 
showed higher correlation between visits; skin and vaginal habitats showed lower correlation between visits. The variation of community stability 
between subjects is especially high in the skin habitats. 



studies. The frequency of Bacteroidetes from HMP data 
is significantly higher than that in the twin study (Wil- 
coxon rank sum test, P <0.001) (Figure S9 in Additional 
file 1). Hierarchical clustering of stool samples using the 
Bray-Curtis dissimilarity was performed as a general 
comparison. Surprisingly, stool clustering revealed two 
distinct groups (Figure SlOA in Additional file 1) with 
each cluster based on projects. The HMP saliva samples 
and the Chinese saliva samples were separated into two 
project-based clusters with the exception of six caries 
and two healthy Chinese saliva samples that clustered 
with HMP healthy samples. Two Chinese caries samples 



were outliers (Figure SlOB in Additional file 1). The 
clear boundary separating samples by projects suggest 
significant variation between the bacterial communities 
in each project. Multiple factors differ between projects 
(for example, read length, DNA extraction, 16S rRNA 
region, see Table S4 in Additional file 2), and these in 
addition to individual variation, demographic difference, 
and other technical issues [45] can have significant 
impact on bacterial community structure. The above 
analyses indicate the challenges for comparisons across 
projects and suggest that care must be taken in use of 
HMP data as healthy controls for other projects. 



Zhou et al. Genome Biology 2013, 14:R1 
http://genomebiology.eom/content/1 4/1 /R1 



Page 13 of 18 



Discussion 

Using the large 16s rRNA gene dataset generated by 
HMP and other projects, we analyzed the biogeographic 
patterns of human microbiota from 22 different habitats. 

The biodiversity by the 22 habitats representing the 
human body has implications in human microbial ecology. 
Each habitat has its own characteristic biodiversity and 
taxon abundance distribution curve. While there are many 
contributing factors to each characteristic pattern, this 
description of the healthy state will contribute to recogni- 
tion of changes associated with disease. The trend of the 
biodiversity changes varies with diseases. For example, 
diversity was reduced in stool habitat with obesity [43], 
but increased in the vaginal habitat with vaginosis [46]. 
However, it is not clear whether the disturbed biodiversity 
is the consequence or the cause of disease. Moreover, 
compared to classical macro-ecological systems, where 
healthy ecosystems show high biodiversity, considered 
important for stability and surviving stress to the system, 
it is apparent that the situation with the human micro- 
biome is more complex as there is no simple rule relating 
biodiversity to health. Thus for use of biodiversity as a 
diagnostic of health or disease, careful analysis of the base- 
line healthy state in each habitat is required. 

Despite the community variation with different habi- 
tats, the structures of these communities in human habi- 
tats adhere to general ecological rules. For example, most 
habitats have a few dominant taxa with a large number of 
individually rare species; the more abundant taxa show 
the most dispersal among subjects, although some minor 
taxa are widespread too; habitats showing lower Shannon 
diversity (vagina, skin) also are among the most variable. 
It was reported that the vaginal bacterial communities 
from healthy adults underwent drastic changes over a 
short period of time [21]. In this study, we also observed 
the dynamic community structure of the anterior nares. 
Clinical studies have revealed that the prevalence of phar- 
yngeal colonization and respiratory tract infections 
caused by Moraxella catarrhalis displays seasonal varia- 
tion and increases in winter [47]. Unfortunately, no 
detailed seasonal information is available for this dataset. 
Further study with multiple sampling points at different 
seasons will provide a better picture of the seasonal effect 
on the bacterial community in the anterior nares. In gen- 
eral, familiar concepts from macro-ecology appear to 
apply. Both cosmopolitan and endemic organisms are 
found in the human microbiome, with many organisms 
found in multiple habitats, albeit at varying abundance, 
so the adage 'everything is everywhere, but environment 
selects' [48] in ecology appears to hold for the human 
microbiome as well. 

Distinguishing rare organisms from noise is challen- 
ging. Although methods for denoising data to remove 
reads with sequencing errors is useful, even after quality 



filtering and chimera removal there are still artifacts to 
be addressed such as contamination. Tests based on 
increasing sequencing depth or prevalence in samples 
partially address artifacts but still leave aspects unre- 
solved. It should be noted that the presence of Propi- 
nionbacterium and Cory neb acterium in stool samples is 
uncertain for these reasons. More carefully designed 
controls for multiple sample processing steps are impor- 
tant for more definitive studies on rare organisms. 

The goal of defining the normal human bacterial biogeo- 
graphy is to be able to identify specific associations of 
altered bacterial biogeography with human disease. The 
high level inter-personal variation of bacterial commu- 
nities has significant impact for study designs. Host geno- 
type, effects due to viruses, sub-genus, and sub-species 
level genetic variation, functional equivalence of multiple 
taxa, and other biological effects all contribute to this var- 
iation and will need to be measured and/or controlled in 
studies seeking disease correlations. In addition, efforts are 
required to carefully select subjects to preclude bias and 
impose universal protocols of metagenomic sample pre- 
paration and sequencing, as exemplified by the HMP. 
Metagenomic DNA storage, preparation [45], sequencing, 
and analysis pipelines vary with different projects, which 
complicates the comparison between projects, as seen 
here in comparisons of HMP data with twin study data as 
well as with Chinese saliva data. 

Despite the scale of the dataset in terms of numbers 
of sequences, the total number of study subjects is still 
relatively small in epidemiological terms, albeit larger 
than most previous studies. Efforts to define the appro- 
priate study design to obtain statistical power for con- 
clusions as to effects at different taxonomic levels are 
underway but are not yet complete. Thus the observa- 
tions reported here should be taken with this caveat in 
mind. 

Finally, we note that the approach presented here pro- 
vides an organismal picture of who' comprises the human 
microbiome, potentially useful for recognizing deviations 
in disease. However, the 16S rRNA gene method involves 
PCR amplification with degenerate primers that may 
nevertheless favor some taxa over others. This method also 
creates chimeric sequences that are largely but not comple- 
tely removed during data processing. With the decrease of 
sequencing cost, sequencing large cohorts of samples by 
WGS is feasible in the future. The taxonomic profiling 
derived from WGS data (that is, MetaPhlAn) will over- 
come the above concerns [49]. Furthermore, the taxonomic 
approach using either 16S rRNA gene or WGS sequences 
must be complemented by studies of the individual organ- 
isms to understand what' functional activities they are pro- 
viding, necessary for a mechanistic understanding of the 
functioning of the microbiome. The pure focus on the 
organisms of many studies of the human microbiome 
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leaves this analysis incomplete. As aptly stated by early 
researchers in the field It is what bacteria do rather than 
what they are that commands attention, since our interest 
centers in the host rather than the parasite' [50]. 

Conclusions 

We comprehensively addressed the bacterial biogeography 
of multiple habitats from a cohort of 279 healthy subjects. 
What emerges is the beginning of a detailed picture of the 
character of each of these habitats with respect to aggre- 
gate ecological properties such as biodiversity as well as the 
specifics of their bacterial compositions. Each of these habi- 
tats was also analyzed with respect to variation between 
subjects, variation over time, and correlations between the 
habitats. While the biogeographic patterns of the human 
microbiome have been establishing, future investigations 
should turn the focus from observational study to hypoth- 
esis driven study. By testing the current and novel biogeo- 
graphic theories, we will better understand the underlying 
mechanisms that govern the observed biogeographic 
patterns. This in turn will enable us to manipulate our 
microbiota in clinic for medical benefit. 

Materials and methods 

Ethics statement 

Subjects provided written informed consent for screening, 
enrollment, and specimen collection. The HMP protocol 
entitled 'HMP-07-001 Human Microbiome Project - Core 
Microbiome Sample Protocol A' was reviewed and 
approved by Institutional Review Boards at Washington 
University in St. Louis, IRB ID#: 201105198 (previously 

08- 0754) and Baylor College of Medicine, IRB ID#: H- 
22895. The preterm infant study 'The Neonatal Micro- 
biome and Necrotizing Enterocolitis' was reviewed and 
approved by the Institutional Review Board at Washington 
University in St. Louis, HRPO #: 201104267 (previously 

09- 0652). Parents of infants provided informed consent 
under HRPO #: 09-0652. The penis samples were obtained 
as part of the project 'The Young Men's Project' reviewed 
and approved by the Indiana University Institutional 
Review Board - 1011004291 (previously 0906-17). The 
conjunctival samples were collected with permission from 
the 'Gambian Government/Medical Research Council 
Unit, The Gambia Joint Ethics Committee' under study 
number L2011.16. The projects adhered to the regulations 
of the boards for all experiments and research was con- 
ducted according to the principles expressed in the 
Declaration of Helsinki. Data were analyzed without per- 
sonal identifiers. 

Sample collection 

HMP samples were collected by teams at the Baylor 
College of Medicine and Washington University in St. 
Louis based on a sampling strategy developed by HMP. 



Core Microbiome Sampling Protocol A (HMP-A) was 
provided by the NIH Roadmap for Medical Research. In 
brief, 236 healthy adults were included in the HMP ana- 
lysis. Fifteen habitats comprised of anterior nares, skin 
(left and right retroauricular crease, left and right ante- 
cubital fossa), oral (hard palate, keratinized gingiva, buc- 
cal mucosa, subgingval plaque, supragingval plaque, 
saliva, tongue dorsum, palatine tonsil, and throat), and 
stool were sampled from both male and female subjects. 
Female subjects were sampled at three extra sites, vagi- 
nal introitus, posterior fornix, and mid vagina. Subjects 
with vaginal pH >4.5 were excluded from the study. For 
longitudinal studies, sets of samples from each habitat 
were collected at two time points (Table S3 in Addi- 
tional file 2). The mean time interval between the two 
time points was 212 days. We also included some data- 
sets from other projects. The control group from a tra- 
choma project with collaborators from The Gambia 
showed no evidence of acute or chronic trachoma and 
had normal healthy conjunctiva. Those samples provide 
a view of microbiota in the conjunctiva. Preterm baby 
stool samples from a neonatal microbiome project [51] 
contribute to the more complete picture of stool micro- 
biome. Microbiota vary greatly in the early weeks of life, 
thus a single sampling point was chosen at the age of 
4 weeks. No antibiotics were taken 3 weeks before sam- 
pling of the preterm baby stool. The urine habitat from 
healthy controls from the Urethral Microbiome of Ado- 
lescent Males project were added to represent other 
body habitats that were not represented in the HMP 
samples [7]. The penis microbiome data was also from 
the Urethral Microbiome of Adolescent Males project. 
Urine specimens were tested for C. trachomatis and 
N. gonorrhoeae infection. 

16S rRNA gene sequencing process 

Sequencing data were produced by the Baylor College of 
Medicine Human Genome Sequencing Center, the Broad 
Institute, the Genome Center at Washington University, 
and the J. Craig Ventor Institute. The quality filtering and 
trimming, chimera checking were performed as described 
[24]. In brief, the protocol allows one mismatch in the pri- 
mer and zero mismatches in the barcode. Chimeric reads 
were removed using Chimera slayer software [52] . At the 
beginning of the HMP, investigation on the Mock Com- 
munity that is composed of known organisms demon- 
strated that the increased richness was mainly from the 
chimeric and low quality reads [31]. All the high quality 
16S rRNA gene reads were classified from phylum to 
genus level at a confidence threshold of 0.5 using the 
Ribosomal Database Project (RDP) Naiive Bayesian Classi- 
fier version 2.2, training set 6 [32]. The reads whose taxo- 
nomic assignments were <0.5 confidence threshold were 
assigned to the unclassified group. Samples with < 1,000 
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reads were removed to ensure adequate representation of 
the community structure. The reads used in this analysis 
can be downloaded from the Data Analysis and Coordina- 
tion Center (DACC) website [55] with SRA study ID 
SRP002395. The metadata of HMP study was submitted 
by the EMMES Corporation, which serves as the clinical 
data collection site for the HMP. It can be obtained from 
dbGaP website [56]. 

Reads subsampling 

Read subsampling is done using the rarefy function from 
the Vegan package in R [53]. The random sampling is 
done without replacement so that the variance of rarefied 
communities is related to the rarefaction proportion rather 
than to the size of the sample. To avoid the bias that is 
potentially caused by different sequencing depths, all the 
samples were rarefied to 1,000 reads in Figures 3, 5, 7, and 
8 and additional files involved comparison across sites. 
The minimal number of reads among all the samples were 
used for the analysis of the taxa abundance association 
between habitats. The taxa relative abundances were cal- 
culated as the reads assigned to the taxa divided by the 
total number of reads (after subsampling) of the sample. 

Accumulation curve 

Accumulation curves were used to evaluate the total num- 
ber of genera in a body habitat and performed using the 
Vegan Package in R [53]. As mentioned in the read sub- 
sampling method, to compare the richness of 22 habitats, 
1,000 reads (minimal reads for the sample) were sub- 
sampled from each sample for the accumulation curve 
analysis (Figure S3A in Additional file 1). To assess the 
overall richness of the 22 habitats, the maximum number 
of reads for each sample was used (Figure 1). For both 
accumulation curves, the number of genera was summed 
stepwise with the accumulation of subjects. Only new gen- 
era were added in each step. This process was repeated 
500 times randomizing the choice of samples. The average 
value for each sample point was plotted. 

Bacterial distribution pattern viewed by rank abundance 
curve 

Rank abundance curve analysis was conducted by aggre- 
gation of the reads from all samples per habitat, and 
further normalized to relative abundance. The genera 
were ranked from highest abundance to the lowest 
abundance along the x-axis and the corresponding rela- 
tive abundances were plotted on the y-axis. Analysis was 
conducted by the Biodiversity package in R [54]. 

Measuring community variation by the overdispersion 
parameter of DM model 

To measure single taxon distribution in a given habitat, 
we computed the range and percentiles of its frequency 



across all the samples. The overall subject variation was 
represented by the average variation of all the taxa in a 
body site and evaluated by the overdispersion parameter 
0 in the Dirichlet-multinomial (DM) model. Human 
metagenomic data with high inter-subject variation fol- 
low Dirichlet-multinomial (DM) distribution as shown 
by us and other independent study recently [40,41]. It is 
defined as: 

, w nHinSih(i-^) + (r-i)^} 

P (Xj = Xi) = ■ 

Xii!...XiK! Y\r:iC^-0) + ir-l)0 

where Xik, is the number of reads in subject i for 
taxon fe, for i = 1, . . . ,P with P the total number of sub- 
jects and k = 1, . . . ,K with K the total number of taxa, 

EK 
Xij is the total number of sequence reads, 
j=i 

7t = [jTj,] = 1, . . . ,K] is the vector containing the mean 
of the taxa-frequencies, and 0 is an overdispersion para- 
meter. The vector n provides the average distribution 
of taxa frequencies across subjects, and the parameter 0 
quantifies the variation of taxa frequency. 0 ranges 
from 0 to 1, if ^ = 0 it reduces to the multinomial dis- 
tribution, which means that every sample has the same 
taxa frequency distribution. ^ is a suitable parameter to 
quantify the variation of subjects as measured by the 
taxa abundance distributions. Both n and 0 can be 
estimated directly from the data using either the meth- 
ods of moments or the maximum likelihood estimation 
method. The above analysis was performed using the R 
package from [41]. 

Diversity comparison 

To compare the diversity between habitats, we com- 
puted Richness and Shannon index for each individual 
in the habitats. For 22 habitats, a two sample t-test was 
performed on paired habitats. The P value was adjusted 
by the Bonferroni method. 

Diversity correlation 

Diversity correlations between habitats were performed 
using the subjects who have samples for both habitats. 
Due to the purpose of comparison of richness diversity 
between habitats, samples were subsampled to the mini- 
mal number of reads of the two habitats to be compared 
by random sampling [53]. The correlation of richness 
diversity of paired habitats was assessed by Spearman 
correlation coefficient. Bray-Curtis dissimilarity matrix 
was calculated for each habitat using the same abun- 
dance matrix generated for evaluation of richness diver- 
sity correlation. Mantel correlation was used to compare 
the paired distance matrices and assess the beta diversity 
association. 
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Evaluation of community similarity between two visits 

Samples with both visit 1 and visit 2 data were used to 
investigate the temporal variation of the bacterial com- 
munity for all of the 18 habitats. The similarity of the 
bacterial community between visits was evaluated by 
Spearman correlation coefficient. 

Additional material 



subjects with dental caries. The majority of the samples are clustered by 
project rather than health status. 

Additional file 2: Table 51. P values of the pair-wised student's t-test 
of richness (sheet 1) and Shannon diversity (sheet 2) for 22 habitats. 
Table S2: The variation of the top 20 most abundant phyla (sheet 1) 
and genera (sheet 2). The range and quartiles of each taxon's relative 
abundance within a habitat are listed. Table S3. Correlation of time 
interval and bacterial community variation between visits. The time 
intervals between visits for 18 body sites are listed. Temporal stability of 
bacterial community varies with habitats, indicated by the Spearman 
correlation coefficient in column 5. There is no strong correlation between 
the time interval and bacterial variation between visits as shown by the 
Spearman correlation coefficient in column 7. There is a weak correlation 
for left retroauricular crease and tongue. Table S4. Comparison of the 
HMP data with other datasets. 
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